Journal of Cheminformatics
○ Springer Science and Business Media LLC
Preprints posted in the last 30 days, ranked by how well they match Journal of Cheminformatics's content profile, based on 25 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.
Liu, T.; Jiang, S.; Zhang, F.; Sun, K.; Head-Gordon, T.; Zhao, H.
Show abstract
Large language models (LLMs) are in the ascendancy for research in drug discovery, offering unprecedented opportunities to reshape drug research by accelerating hypothesis generation, optimizing candidate prioritization, and enabling more scalable and cost-effective drug discovery pipelines. However there is currently a lack of objective assessments of LLM performance to ascertain their advantages and limitations over traditional drug discovery platforms. To tackle this emergent problem, we have developed DrugPlayGround, a framework to evaluate and benchmark LLM performance for generating meaningful text-based descriptions of physiochemical drug characteristics, drug synergism, drug-protein interactions, and the physiological response to perturbations introduced by drug molecules. Moreover, DrugPlayGround is designed to work with domain experts to provide detailed explanations for justifying the predictions of LLMs, thereby testing LLMs for chemical and biological reasoning capabilities to push their greater use at the frontier of drug discovery at all of its stages.
Ulusoy, E.; Bostanci, S.; Deniz, B. E.; Dogan, T.
Show abstract
MotivationMolecular representation learning is central to computational drug discovery. However, most existing models rely on single-modality inputs, such as molecular sequences or graphs, which capture only limited aspects of molecular behaviour. Yet unifying these modalities with complementary resources such as textual descriptions and biological interaction networks into a coherent multimodal framework remains non-trivial, hindering more informative and biologically grounded representations. ResultsWe introduce SELFormerMM, a multimodal molecular representation learning framework that integrates SELFIES notations with structural graphs, textual descriptions, and knowledge graph- derived biological interaction data. By aligning these heterogeneous views, SELFormerMM effectively captures complementary signals that unimodal approaches often overlook. Our performance evaluation has revealed that SELFormerMM outperforms structure-, sequence-, and knowledge-based models on multiple molecular property prediction tasks. Ablation analyses further indicate that effective cross-modal alignment and modality coverage improve the models ability to exploit complementary information. Overall, integrating SELFIES with structural, textual, and biological context enables richer molecular representations and provides a promising framework for hypothesis-driven drug discovery. AvailabilitySELFormerMM is available as a programmatic tool, together with datasets, pretrained models, and precomputed embeddings at https://github.com/HUBioDataLab/SELFormerMM. Contacttuncadogan@gmail.com
Nada, H.; Sipos-Szabo, L.; Bajusz, D.; Keseru, G.; Gabr, M.
Show abstract
Despite advances in computational drug discovery, de novo drug design remains hindered by high licensing costs and the need for specialized programming expertise. We present LigandForge, a webserver for structure-guided de novo ligand generation. LigandForge integrates structural validation and binding-site characterization; voxel-based property grid construction for spatial mapping of electrostatics and hydrophobicity; chemistry-aware fragment assembly; multi-objective lead optimization; and retrosynthetic feasibility analysis. The platform utilizes a structure-guided framework to assemble molecules from curated fragment libraries while enforcing physicochemical constraints, including molecular weight, LogP, and hybridization states. Generated molecules are refined via reinforcement learning and genetic algorithms which are subsequently evaluated using composite metrics such as the quantitative estimate of drug-likeness. By leveraging RDKit for cheminformatics and NGL viewer for real-time 3D visualization, LigandForge provides a synthesis-aware environment that bridges the gap between macromolecular structural data and experimentally feasible lead compounds without requiring local software installation.
Wang, Y.; Rao, J.; Zhang, W.; Shi, Y.; Zeng, C.; Cui, R.; Wang, Y.; Xiong, J.; Li, X.; Zheng, M.
Show abstract
Accurate prediction of drug metabolites and enzyme selectivity is essential for rational drug design and safety assessment. However, existing computational approaches are often limited to specific enzyme families or reaction types, lacking the capacity to model enzyme-subtype specificity and prioritize major metabolites. Here, we present MetaReact, an end-to-end generalizable Transformer-based model that unifies the prediction of metabolic enzymes, metabolites, and sites of metabolism (SOM). By integrating structure-aware encoding ReactSeq, a chemistry reaction-based pretraining, MetaReact consistently outperforms state-of-the-art methods across multiple benchmarks under three settings: enzyme-agnostic, enzyme-completion, enzyme-conditioned. Notably, it achieves 60% Top-3 accuracy in identifying major metabolites and superior CYP450 enzyme-subtype prediction/SOM recognition. Case studies validate its applicability to complex natural products, synthetic cannabinoids, and clinical candidates, facilitating toxicity assessment and molecular optimization. This scalable, rule-free solution advances human metabolism modeling, with potential for computational pharmacokinetics and early drug discovery.
Broster, J. H.; Popovic, B.; Kondinskaia, D.; Deane, C. M.; Imrie, F.
Show abstract
Molecular docking aims to predict the binding conformation of a small molecule to its protein target. Recent work has proposed diffusion models for this task, from rigid-body docking that diffuses over ligand degrees of freedom to co-folding approaches that jointly generate protein structure and ligand pose. However, diffusion-based docking models have been shown to frequently produce physically implausible poses and fail to consistently recover key protein-ligand interactions. To address this, we introduce a reinforcement learning framework for training diffusion-based docking models directly on non-differentiable objectives. Fine-tuning DiffDock-Pocket for physical validity with our approach substantially increases the number of generated poses that are physically valid and interaction-preserving, with no increase in inference-time compute. Importantly, this comes without sacrificing structural accuracy; in fact, our approach increases the proportion of structures with near-native poses. These effects are most pronounced for protein targets that are dissimilar to the training data. Our fine-tuned DiffDock-Pocket model outperforms both classical docking algorithms and machine learning-based approaches on the PoseBusters set. Our results demonstrate that reinforcement learning can teach diffusion-based docking models to better respect physical constraints and recover key interactions, without the requirement to rely on inference-time corrections.
Rajbanshi, B.; Iqbal, K.; Guruacharya, A.
Show abstract
Assessing whether a preclinical drug candidate will work is not a prediction problem but a reasoning problem. The same numerical output warrants different interpretations depending on the target and therapeutic context. CNS drug development presents the most demanding instance of this reasoning problem. For example, a compound must cross the blood-brain barrier, resist efflux transport, and achieve adequate receptor occupancy at a dose that clears safety margins. The constraints interact with each other in a web that needs careful interpretation. Here, we show that Cirrina, an LLM agent coupled to eight mechanistic pharmacology tools, can reason across the input data to provide better decisions and a well documented reasoning trace. The LLM agent reasons across multiple data tiers from SMILES to animal PK/PD measurements adjusting thresholds based on target-specific requirements. Validated against 181 CNS compounds, it achieved a 68% accuracy compared to a rule-based deterministic pipeline of 31% accuracy. In 103 discordant cases, the agents reasoning was correct in 75% of instances compared to only 10% for deterministic pipelines. Cirrina provides a scalable, documented framework for preclinical decision-making, effectively identifying failure-prone candidates that generic thresholds overlook, and thereby reducing the chances of failure along the clinical development cycle.
Down, T.; Warowny, M.; Walker, A.; DAscenzo, L.; Lee, D.; Zhou, Z.; Cao, S.; Bainbridge, T. W.; Nicoludis, J. M.; Harris, S. F.; Mukhyala, K.
Show abstract
As computational tools and machine learning models for protein sciences continue to advance and proliferate, bench scientists face increasing technical challenges adopting these tools for specific applications such as drug discovery. Here we present GYDE (Guide Your Design and Engineering), an open-source, versatile, and web-based collaboration platform designed to make computational analyses of proteins and antibodies easily accessible to bench scientists. GYDE enables the exploration of sequence-structure-function relationships through a tightly integrated visual interface, offering researchers a comprehensive exploration of protein functional determinants either via real assay data or computational tools. GYDEs intuitive interface facilitates seamless access to cutting-edge AI models for protein and antibody structure prediction, design, and downstream analyses. The flexible and easy addition of new tools and models is facilitated by the use of the Slivka compute API. The platform supports saved sessions that enable researchers to easily share their findings with other users, fostering a more collaborative scientific community. GYDE is freely available for protein scientists in academia and industry to build drug discovery analytics platforms customized to their needs.
Tadiello, M.; Ludaic, M.; Viliuga, V.; Elofsson, A.
Show abstract
MotivationAlphaFold has transformed structural biology with an unprecedented accuracy in modeling protein structures and their interactions with biomolecules, with AlphaFold3 (AF3) achieving state-of-the-art performance. However, AF3 and other methods often struggle to accurately predict the structure of protein complexes that lack strong co-evolutionary information, such as antibody-antigen (Ab-Ag) complexes. One of the fundamental issues is that AF3 often generates accurate predictions, but fails to reliably distinguish them from the much larger set of incorrect ones. ResultsTo address this, we propose ABAG-Rank, a deep neural network that provides an efficient and robust solution for model selection of Ab-Ag interactions from a pool of structural ensembles predicted with AlphaFold. Built on the permutation-invariant DeepSets architecture, ABAG-Rank can process variable-sized ensembles of structural decoys and is directly applicable to prediction settings in which the number of candidates may vary. We train a model on a redundancy-reduced set of all known antibody-antigen complexes and find that simple geometric descriptors, along with confidence scores from AlphaFold, provide rich information about interface quality without requiring intensive physics-based calculations. Our experiments demonstrate that ABAG-Rank significantly outperforms AF3 internal scoring and the ranking performance of existing deep learning baselines. ImplementationSource code can be found at: https://github.com/tadteo/ABAG-Rank
Poelmans, R.; Van Eynde, W.; Bruncsics, B.; Bruncsics, B.; Arany, A.; Moreau, Y.; Voet, A. R.
Show abstract
AbstractThe development of machine learning models for protein-ligand interactions is fundamentally constrained by the quality and diversity of available structural data. Existing databases of protein-ligand complexes present researchers with an unsatisfying trade-off: carefully curated collections such as PDBBind and HiQBind offer high structural reliability but cover only a narrow slice of the Protein Data Bank (PDB), while large-scale resources like PLInder provide broad coverage at the expense of rigorous quality control. Here, we introduce CROWN (Curated Repository Of Well-resolved Non-covalent interactions), a machine learning-ready dataset that reconciles this tension by applying a comprehensive, fully automated preprocessing pipeline to the PLInder database. Starting from 649,915 protein-ligand interaction systems, CROWN applies a series of interleaved quality filters and processing stages addressing crystallographic resolution, ligand identity, pocket completeness, structural repair, interaction quality, and protonation at physiological pH. A distinguishing feature of the pipeline is a final constrained energy minimisation step using custom flat-bottomed restraints, which balances crystallographic evidence with relaxation of intramolecular strain. This step -- absent from existing protein-ligand datasets -- produces structurally uniform complexes by reconciling the heterogeneous refinement practices of different crystallographers and structure determination protocols, without distorting the experimentally observed binding geometry. The resulting dataset of 153,005 complexes represents a roughly four-fold increase in protein and species diversity over PDBBind and HiQBind, while maintaining rigorous structural standards. Importantly, CROWN adopts a geometry-centric design philosophy that treats the 3D arrangement of atoms at the binding interface as a self-consistent source of information, rather than relying on externally measured binding affinities that cover only a fraction of known structures and introduce well-documented biases. We anticipate that CROWN will serve as a broadly useful resource for training generative models of protein-ligand binding poses, developing scoring functions, and benchmarking interaction prediction methods.
Poelmans, R.; Bruncsics, B.; Arany, A.; Van Eynde, W.; Shemy, A.; Moreau, Y.; Voet, A. R.
Show abstract
Knowledge-based potentials (KBPs) have long been used to score protein-ligand interactions, yet existing formulations remain isotropic, capturing only distance dependencies and neglecting the directional preferences that govern molecular recognition. Here, we introduce Direction-Enhanced Scoring POTentials (DESPOT), an anisotropic knowledge-based framework that unifies pose scoring and binding-site characterisation within a single probabilistic model. The new probabilistic formulation used in DESPOT naturally supports directional modelling through atom type-specific local reference frames and symmetry-aware geometric discretisation. It also supports steric exclusion, encoded as a dedicated void state that explicitly captures the probability that a spatial bin remains unoccupied. The anisotropic interaction profiles learned by DESPOT reveal systematic directional preferences for interactions such as hydrogen bonds, aromatic interactions, and halogen bonds, that extend beyond idealised geometric models. Evaluation on the CASF-2016 benchmark shows that DESPOT sub-stantially outperforms isotropic KBPs in all pose-discrimination and virtual screening tasks (p << 0.0001 for all enrichment factors), with the largest gains arising from its ability to penalise geometrically implausible poses. Constrained energy minimisation of training structures proves strongly beneficial for the derivation of KBPs, while our train-test leakage analysis reveals that overfitting is an underestimated and understudied issue for KBPs. DESPOT provides a data-driven framework for direction-aware modelling of protein-ligand interactions, with applications in pose scoring, binding-site characterisation, and structure-based design.
Joshi, S.; Sowdhamini, R.
Show abstract
MotivationCharacterizing atomic-level stability and cooperative interaction networks is essential for understanding protein function and evolution. However, existing tools often lack the precision to integrate detailed physicochemical energies with higher-order graph-theoretic analyses. ResultsWe present HORI-EN, an updated implementation to the HORI framework, featuring hybrid energetic scoring (Physicochemical + Knowledge-Based) and a Normalized Interaction Score (NIS) based on cumulative distribution functions. HORI-EN identifies higher-order cliques of interacting residues, revealing cooperative stabilization networks. Validation on the SKEMPI v2 dataset demonstrates that HORI-EN shows discriminative performance in identifying mutational hotspots, achieving an ROC-AUC of 0.780 on the full dataset and 0.844 on a clean benchmark. Enrichment analysis indicates a 3.1-fold increase in precision for the top 1% of predictions. Furthermore, analysis of the residue interaction network recovers 77.4% of non-contacting hotspots by identifying one-hop bridging interactions to the partner chain. Beyond hotspot prediction, HORI-EN distinguishes native structures from decoys and captures conserved energetic signatures in evolutionary case studies of serine proteases and lipases. Availability and ImplementationThe web server is freely available at https://caps.ncbs.res.in/HORI-EN and source code is available at https://github.com/thesixeyedknight/HoriPy. Contactmini@ncbs.res.in
Duarte, S. A.; Mehdiabadi, M.; Bugnon, L. A.; Aspromonte, M. C.; Piovesan, D.; Milone, D. H.; Tosatto, S.; Stegmayer, G.
Show abstract
Intrinsically disordered proteins (IDPs) play an important role in a wide range of biological functions and are linked to several diseases. Due to technical difficulties and the high cost of experimental determination of disorder in proteins, combined with the exponential increase of unannotated protein sequences, the development of computational methods for disorder prediction became an active area of research in the last few decades. In this work, we present emb2dis, a deep learning model that uses protein language models (pLMs) to predict disorder from sequence. The emb2dis tool is a pre-trained model that receives as input a protein sequence, calculates its pLM embedding and passes it to a deep learning model. In contrast to existing approaches, emb2dis integrates informative sequence representations with a novel architecture that combines residual networks (ResNets) and dilated convolutions. This design effectively enlarges the receptive field of the convolution operation, enabling the model to better capture an extended context of each amino acid. At the output, emb2dis assigns a disorder propensity score to each residue in the sequence. The model was evaluated on datasets from the latest CAID3 blind benchmark for disorder prediction, where it achieved first place in the Disorder-PDB category, exhibiting strong performance with high AUC and Fmax scores. Additionally, it ranked among the top ten methods on the Disorder-NOX dataset. We provide a freely available web-demo for emb2dis and a source code repository for local installation. Weblink for the toolhttps://sinc.unl.edu.ar/web-demo/emb2dis/ The importance of the emb2dis tool is that it provides a new deep learning approach and significant improvements in the prediction of protein disorder, with a simple web interface and graphical output detailing per-residue disorder.
Shrimpton-Phoenix, E.; Notari, E.; Wood, C. W.
Show abstract
The incorporation of non-canonical amino acids (ncAAs) is a powerful strategy for introducing novel chemical functions into proteins. Molecular dynamics (MD) simulations are essential for understanding the structural and dynamic effects of these modifications, yet the creation of accurate force field parameters for ncAAs remains a significant bottleneck. Current parameterisation methods are often inaccurate or computationally expensive. To address this, we present drFrankenstein, an automated pipeline for generating AMBER force field parameters for ncAAs. drFrankenstein is a robust and accessible tool that streamlines the parameterisation workflow, enabling the routine use of MD simulations to study the behaviour of ncAA-containing proteins.
Algorta, J.; Walther, D.
Show abstract
Metabolic pathways are often hypothesized to benefit from the spatial organization of enzymes, facilitating substrate transfer through mechanisms such as metabolic channeling or metabolon formation. However, it remains unclear whether the spatial proximity of catalytic sites represents a general organizational principle of metabolism or is restricted to specific pathways. Here, we investigate whether consecutive enzymes in metabolic pathways, when physically interacting, exhibit structurally optimized arrangements that minimize distances between their catalytic sites, thereby increasing metabolite transfer efficiency from one enzyme to the next. We first evaluated the ability of current protein-protein interaction prediction methods, including AlphaFold2, AlphaFold3, ESMFold, and HDOCK, to model weak and transient interactions using a benchmark dataset of 112 low-affinity protein dimers from PDBbind. AlphaFold-based approaches performed best in recovering correct interaction geometries, while ESMFold showed limited performance. We further assessed several confidence metrics and identified ipTM, ipSAE, and VoroIF-GNN as the most informative predictors of correct interaction conformations. In addition to simple Euclidean distance metrics, we developed a computational procedure to estimate shortest accessible space paths between catalytic sites in predicted enzyme-enzyme complexes. Applying this framework to 107 consecutive enzyme pairs in E.coli revealed an increased tendency for consecutive enzymes to interact, but no systematic evidence that interacting enzymes position their catalytic sites in spatially optimized configurations. In the predicted complex conformations, catalytic sites tend not to be positioned closer than expected at random. The developed computational workflow provides a general framework for analyzing structural aspects of metabolic organization.
Zhang, L.; Wang, L.; Sun, X.; Tang, W.; Su, H.; Qian, Y.; Yang, Q.; Li, Q.; Tang, Z.; Sun, H.; Han, Y.; Jiang, Y.; Lou, W.; Zhou, B.; Wang, X.; Bai, L.; Xie, Z.
Show abstract
Computational drug discovery, particularly the complex workflows of drug molecule screening and optimization, requires orchestrating dozens of specialized tools in multi-step workflows, yet current AI agents struggle to maintain robust performance and consistently underperform in these high-complexity scenarios. Here we present MolClaw, an autonomous agent that leads drug molecule evaluation, screening, and optimization. It unifies over 30 specialized domain resources through a three-tier hierarchical skill architecture (70 skills in total) that facilitates agent long-term interaction at runtime: tool-level skills standardize atomic operations, workflow-level skills compose them into validated pipelines with quality check and reflection, and a discipline-level skill supplies scientific principles governing planning and verification across all scenarios in the field. Additionally, we introduce MolBench, a benchmark comprising molecular screening, optimization, and end-to-end discovery challenges spanning 8 to 50+ sequential tool calls. MolClaw achieves state-of-the-art performance across all metrics, and ablation studies confirm that gains concentrate on tasks that demand structured workflows while vanishing on those solvable with ad hoc scripting, establishing workflow orchestration competence as the primary capability bottleneck for AI-driven drug discovery.
Lin, R.; Ahnert, S. E.
Show abstract
Protein function is fundamentally driven by structural dynamics, yet the majority of structural bioinformatics treats proteins as static rigid bodies. While Molecular Dynamics (MD) simulations attempt to capture these motions, they are computationally prohibitive for exploring large-scale conformational changes, such as domain movements or allostery, which occur on timescales often inaccessible to standard simulation. However, the Protein Data Bank (PDB) contains a latent wealth of dynamic information in the form of redundant entries proteins solved in multiple distinct conformational states. Detecting these "shape-shifting" pairs remains challenging because standard structural alignment algorithms (e.g., TM-align) rely on rigid-body superposition, which fails when substantial geometric rearrangement occurs. In this study, we introduce a high-throughput method to systematically mine the PDB for proteins that share identical topology but exhibit divergent tertiary conformations. By utilizing a coarse-grained Secondary Structure Element (SSE) representation, we decouple topological connectivity from geometric rigidity, allowing for the detection of conformational homologues that share low global structural similarity despite high predicted structural similarity. We applied this "conformational scanning" across the entire RCSB database, identifying a curated dataset of proteins undergoing significant structural rearrangements. This work bridges the gap between static structural data and dynamic function, providing a critical "ground truth" dataset for benchmarking data-driven protein design and checking the plausibility of generative structure models.
Arab, S. S.; Lewis, N. E.
Show abstract
Amino acid oxidation is a major cause of protein instability and loss of function in therapeutic and industrial settings. Although methionine, cysteine, tyrosine, and tryptophan residues are widely recognized as oxidation-prone, only a subset of such residues are dominant functional hotspots, and not all are suitable targets for mutation. Identifying these vulnerable yet engineerable sites remains a major challenge. Here, we present EvoMut, a residue-level analytical framework for evaluating both oxidative vulnerability and mutation feasibility. EvoMut estimates oxidation risk by integrating structural features, local functional context, intrinsic chemical susceptibility, and evolutionary conservation. A central feature of the framework is the explicit separation of oxidation risk from mutation feasibility: candidate substitutions are evaluated only after high-risk residues are identified and ranked by evolutionary substitution patterns. Application of EvoMut to multiple proteins, and evaluation with experimental data, showed that oxidation-prone residues differ markedly in their engineering potential. EvoMut distinguishes residues that are both oxidation- sensitive and evolutionarily permissive from those that are chemically vulnerable but functionally constrained. By providing residue-level mechanistic insight, EvoMut offers a practical framework for the rational design of oxidation-resistant proteins. EvoMut is freely available as a web server at https://evomut.org. Significance StatementStrategies to improve oxidative stability in proteins often rely on chemical intuition or solvent accessibility alone, with limited consideration of functional and evolutionary constraints. EvoMut addresses this gap by explicitly separating oxidative vulnerability from mutation feasibility and integrating structural, chemical, functional, and evolutionary information within an interpretable framework. It helps explain why some oxidation-prone residues can be successfully engineered whereas others remain constrained; thus, supporting rational decision-making in oxidative stability engineering.
Zeng, W.; Li, X.; Zou, H.; Dou, Y.; Zhao, X.; Peng, S.
Show abstract
Multi-objective reinforcement learning based on predicted structure feedback has been introduced into protein inverse folding. However, existing methods typically rely on a single model to optimize multiple structural objectives via a scalarized reward, which can bias the optimization toward dominant objectives and limit the exploration of diverse solutions. Here, we propose a online Symmetric Self-play Preference Optimization (SSP) framework that decouples the optimization of multiple structural objectives by training separate preference models with distinct reward signals, while enabling interaction through a shared sampling pool. This design allows the models to explore diverse optimization trajectories without enforcing a single dominant direction. Extensive experiments on both natural and de novo binder backbone inverse folding tasks demonstrate that SSP consistently improves sequence design self-consistency compared to single-model and existing baselines. Further analysis shows that different structural objectives are only partially aligned and induce distinct optimization directions, as evidenced by metric correlation and white-box analyses. This supports the effectiveness of decoupling objectives to enable higher design quality in protein design.
Kudari, Z.; Kaira, V. S.; P, S. S.; Bhat, R.; Gnana Sekaran, J.
Show abstract
Accurate prediction of drug-target affinity (DTA) is a core challenge in computational drug discovery. Structure-based methods depend on experimentally determined protein coordinates, which are unavailable for most drug-relevant targets. sequence-only approaches, in turn, operate on linear residue representations and lack an explicit mechanism to encode the spatial proximity relationships that govern protein-ligand interactions. We present XAttn-DTA, a sequence-driven framework that addresses both limitations without requiring experimental structural data. Drug molecules are encoded as 2D molecular graphs via multilayer Graph Attention Networks (GATs), capturing atomic topology and bond-level chemistry. Proteins are represented as residue-level graphs constructed from ESM2-predicted contact maps, that captures inter-residue coevolutionary and structural signals embedded within the sequence. The bidirectional cross-attention fusion module projects both embeddings into a shared latent space and applies dual multi-head cross-attention. This enables ligand and protein residue environments to inform one another. On the Davis benchmark, XAttn-DTA achieves a concordance index (CI) of 0.907 and MSE of 0.175, improving CI by 1.8% and reducing MSE by 9.3% over the strongest baseline. On KIBA, it achieves an MSE of 0.121, a 13.6% reduction. Under three strict cold-start settings across Davis, KIBA, and BindingDB, the model yields MSE reductions of up to 79.0% and CI improvements of up to 31.5% over the strongest baseline, demonstrating strong generalization to unseen scaffolds and novel protein families.
Secker, C.; Secker, P.; Yergoez, F.; Celik, M. O.; Chewle, S.; Phuong Nga Le, M.; Masoud, M.; Christgau, S.; Weber, M.; Gorgulla, C.; Nigam, A.; Pollice, R.; Schuette, C.; Fackeldey, K.
Show abstract
The identification of suitable lead molecules in the vast chemical space is a critical and challenging task in drug discovery campaigns. Recently, it has been demonstrated that large-scale virtual screening provides a powerful approach to accelerate the identification of novel drug candidates by screening ever increasing virtual ligand libraries, which have reached magnitudes of > 1020 compounds. However, this desirable increase in potentially bioactive molecules poses a new challenge as enumerating and virtually screening such huge compound libraries is computationally prohibitive. Consequently, advanced approaches to navigate ultra-large chemical spaces and to identify suitable candidate molecules therein are urgently needed. Here, we present an evolutionary algorithm framework using molecular generative AI, reaction-based substructure searching, and iterative model fine-tuning for a targeted and efficient exploration of chemical fragment spaces. Combining this approach with large-scale virtual screening we are able to identify target-specific candidate molecules within the commercially available Enamine REAL Space ([~]1015). We demonstrate the applicability of the approach by successfully identifying and biochemically validating pH-specific ligands of the {micro}-opioid receptor. Our results demonstrate that integrating generative AI with evolutionary algorithms provides a promising route to explore ultra-large chemical spaces for the discovery of novel, synthetically accessible lead molecules.